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Abstract 

Most  standard  approaches  to  the  static  analysis  of  pro¬ 
grams,  such  as  the  popular  worklist  method,  are  first-order 
methods  that  inductively  annotate  program  points  with  ab¬ 
stract  values.  In  this  paper  we  introduce  a  second-order  ap¬ 
proach  based  on  Kleene  algebra.  In  this  approach,  the  pri¬ 
mary  objects  of  interest  are  not  the  abstract  data  values,  but 
the  transfer  functions  that  manipulate  them.  These  elements 
form  a  left-handed  Kleene  algebra.  The  dataflow  labeling 
is  not  achieved  by  inductively  labeling  the  program  with  ab¬ 
stract  values,  but  rather  by  computing  the  star  ( Kleene  clo¬ 
sure)  of  a  matrix  of  transfer  functions.  In  this  paper  we  in¬ 
troduce  the  method  and  prove  soundness  and  completeness 
with  respect  to  the  standard  worklist  algorithm. 


1  Introduction 

Dataflow  analysis  and  abstract  interpretation  are  con¬ 
cerned  with  the  static  derivation  of  information  about  the 
execution  state  at  various  points  in  a  program.  There  is  typ¬ 
ically  a  semilattice  L  of  types  or  abstract  values,  each  de¬ 
scribing  a  larger  set  of  possible  runtime  values.  The  natural 
partial  order  associated  with  L  is  the  subtype  relation.  The 
objective  of  the  analysis  is  to  associate  an  element  of  L  with 
each  point  of  the  program  that  represents  what  is  known 
about  the  program  state  whenever  control  passes  through 
that  point.  This  may  consist  of  type  information,  bounds 
on  values  of  variables  or  registers,  operand  stack  depth,  the 
shape  of  data  structures,  whether  pointers  are  null,  etc.  It  is 
usually  only  an  approximation;  the  lower  in  the  semilattice, 
the  better  the  approximation. 

Each  instruction  has  one  or  more  associated  transfer 
functions  f  :  L  — >  L  that  describe  how  the  state  is  trans¬ 
formed  by  the  instruction.  The  domain  of  /  is  determined 
by  the  type  of  the  instruction.  Membership  in  the  domain  of 
/  may  be  considered  a  precondition  for  the  safe  execution 


of  the  instruction;  attempting  to  apply  /  to  an  element  of 
L  not  in  its  domain  signals  a  type  error.  For  example,  an 
empty  stack  should  not  be  popped.  Transfer  functions  may 
be  composed,  provided  there  is  no  type  mismatch.  Transfer 
functions  may  be  polymorphic,  as  for  example  in  the  case 
of  a  swap  operation  that  interchanges  the  top  two  elements 
of  a  stack. 

Each  instruction  has  zero  or  more  successors.  Data 
manipulation  instructions  such  as  loads,  stores,  and  arith¬ 
metic  operations  have  the  fallthough  instruction  as  succes¬ 
sor.  Conditional  jumps  have  the  fallthrough  as  well  as  the 
jump  target,  and  unconditional  jumps  have  only  the  jump 
target.  Any  instruction  that  can  raise  an  exception  has  the 
entry  point  of  an  exception  handler  as  a  successor.  Suc¬ 
cessors  are  determined  solely  by  statically  available  infor¬ 
mation.  Thus  the  program  can  be  modeled  by  a  directed 
control  flow  graph  G  whose  nodes  are  the  instructions  and 
whose  edges  go  from  each  instruction  to  the  successors  of 
that  instruction.  The  edges  are  labeled  with  the  transfer 
functions  associated  with  that  instruction.  In  most  cases  the 
transfer  function  is  the  same  for  all  edges  exiting  a  partic¬ 
ular  node,  but  in  some  cases  it  is  different.  For  example, 
in  lava  bytecode,  if  an  instruction  throws  an  exception,  then 
the  operand  stack  is  cleared  and  the  exception  object  pushed 
onto  the  stack  before  invoking  the  handler.  The  successor 
state  corresponding  to  the  exception  handler  thus  reflects  a 
different  stack  configuration  than  that  corresponding  to  the 
fallthrough  instruction. 

The  worklist  algorithm  for  dataflow  analysis  is  a  stan¬ 
dard  method  for  computing  a  least  fixpoint  labeling  of  the 
nodes  of  G  with  elements  of  L  [6],  It  works  as  follows. 
First,  the  entry  point  of  the  method  is  labeled  with  the  ele¬ 
ment  of  L  describing  the  initial  state  of  the  computation. 
For  example,  in  Java  bytecode,  the  initial  label  consists 
of  an  empty  operand  stack,  the  types  of  the  arguments  to 
the  method  (including  the  object  itself  if  it  is  an  instance 
method)  in  the  first  few  local  variables,  and  a  special  unde¬ 
fined  marker  for  the  remaining  local  variables.  This  node 


is  marked  as  changed  and  placed  on  a  worklist.  Then,  as 
long  at  the  worklist  is  nonempty,  the  procedure  repeatedly 
removes  the  next  element  s  of  the  worklist,  and  for  each  ex¬ 
iting  edge  ( s,t ),  applies  the  transfer  function  /  associated 
with  that  edge  to  the  label  x  £  L  of  s  to  get  f(x),  then 
updates  the  label  of  the  successor  t,  with  f(x).  If  t  is  unla¬ 
beled,  then  it  is  labeled  with  f(x).  If  t  is  already  labeled, 
then  it  is  relabeled  with  the  join  of  f(x)  and  its  current  la¬ 
bel  in  the  semilattice,  if  the  join  exists.  This  indicates  the 
best  information  that  is  known  about  the  program  state  at 
t  from  the  various  control  flow  paths  into  t,  that  have  been 
analyzed  so  far.  If  the  join  of  /( x)  and  the  current  label  of 
t  does  not  exist  in  the  semilattice,  then  it  is  a  type  error.  If 
t  is  successfully  labeled  and  the  new  label  is  different  from 
the  old,  then  t  is  marked  as  changed  and  placed  back  on 
the  worklist.  When  the  worklist  becomes  empty,  the  result¬ 
ing  labeling  is  the  least  fixpoint  of  a  monotone  mapping  on 
labelings  defined  in  terms  of  the  transfer  functions. 

One  disadvantage  of  the  worklist  approach  is  that  long 
paths  in  the  graph  may  be  analyzed  several  times.  For  ex¬ 
ample,  if  a  node  s  is  labeled  with  x  £  L,  then  later  revisited 
and  relabeled  with  y  >  x,  then  any  long  paths  out  of  s  may 
be  traversed  again.  The  running  time  could  be  as  bad  as 
dn ,  where  n  is  the  size  of  the  program  and  d  is  the  depth 
of  the  semilattice,  although  this  worst-case  bound  is  proba¬ 
bly  rarely  attained  in  practice.  Thus  the  worklist  algorithm 
remains  a  popular  method  for  many  practical  program  anal¬ 
ysis  tasks. 

In  this  paper  we  describe  an  alternative  approach  that  can 
be  used  to  avoid  the  recalculation  of  dataflow  information 
along  long  paths  using  a  symbolic  method  based  on  Kleene 
algebra.  The  elements  of  the  algebra  are  transfer  functions. 
The  novelty  of  this  approach  is  that  it  is  the  transfer  func¬ 
tions,  not  the  data  values,  that  are  the  objects  of  primary 
algebraic  interest.  The  transfer  functions  are  elements  of  a 
certain  algebraic  structure  called  a  left-handed  Kleene  alge¬ 
bra  with  the  operations  of  composition,  join,  and  iteration. 
The  control  flow  graph  of  a  method  with  n  instructions  gives 
rise  to  an  n  x  n  matrix  of  transfer  functions,  and  computing 
the  star  or  Kleene  closure  of  this  matrix  amounts  to  com¬ 
puting  the  dataflow  information  at  all  points  of  the  program 
simultaneously. 

In  a  companion  paper  [8],  we  describe  a  concrete  appli¬ 
cation  of  these  ideas  in  the  context  of  lava  bytecode.  That 
application  combines  the  second-order  approach  introduced 
in  this  paper  with  the  standard  worklist  algorithm  to  obtain 
a  hybrid  algorithm  with  running  time  0(nm  +  m3),  where 
m  is  the  size  of  a  cutset  (a  set  of  nodes  breaking  all  cycles 
in  G).  The  algorithm  avoids  recalculation  of  dataflow  infor¬ 
mation  along  long  paths  by  computing  the  star  (closure)  of 
an  m  x  m  matrix  of  transfer  functions.  This  may  give  an 
improvement  when  m  is  small  compared  to  n. 

In  this  paper,  we  lay  the  foundations  of  this  approach 


and  prove  correctness  with  respect  to  the  standard  work- 
list  algorithm  [6],  This  paper  is  organized  as  follows.  In 
Section  2,  we  review  the  pertinent  definitions  of  Kleene  al¬ 
gebra  and  left-handed  Kleene  algebra,  including  the  forma¬ 
tion  of  matrices,  and  describe  how  transfer  functions  can 
be  modeled  as  strict  monotone  functions  on  a  semilattice  of 
abstract  values  or  types.  In  Section  3,  we  present  an  alter¬ 
native  approach  to  static  analysis  based  on  computing  the 
star  or  Kleene  closure  of  a  matrix  of  transfer  functions.  Fi¬ 
nally,  in  Section  4  we  prove  that  our  second-order  method 
produces  the  same  final  dataflow  labeling  as  the  standard 
worklist  algorithm  on  all  type-correct  programs. 

1.1  KAT  and  the  Static  Analysis  of  Programs 

In  [10],  we  showed  how  Kleene  algebra  with  tests 
(KAT),  a  variant  of  KA  that  includes  Boolean  tests,  can 
be  used  to  statically  verify  compliance  with  safety  policies 
specified  by  security  automata,  a  general  mechanism  for 
the  specification  and  enforcement  of  a  large  class  of  safety 
policies  [12],  We  proved  the  soundness  and  completeness 
of  the  method  over  relational  interpretations,  and  illustrated 
the  method  on  an  example  of  [3]  involving  the  verification 
of  a  device  driver. 

The  results  of  [10]  are  very  different  from  those  of  the 
present  paper.  In  that  paper,  the  objective  was  to  show  how 
the  deductive  system  could  be  used  as  a  mechanism  to  prop¬ 
agate  state  information  throughout  the  program.  That  was 
also  a  first-order  approach.  The  Boolean  algebra  and  the  de¬ 
ductive  system  were  essential  components  of  that  program. 
Here,  we  are  not  restricting  ourselves  to  Boolean  informa¬ 
tion,  and  we  are  not  using  the  deductive  system. 

2  Background 

2.1  Upper  Semilattices 

An  upper  semilattice  is  a  partially  ordered  set  L  in  which 
every  finite  set  has  a  least  upper  bound,  which  must  be 
unique.  The  least  upper  bound  of  two  elements  x  and  y 
is  denoted  x  +  y.  The  least  upper  bound  of  the  null  set  is 
denoted  _L.  The  operation  +  is  associative,  commutative, 
idempotent  (x  +  x  =  x),  and  x  <  y  iff  x  +  y  =  y.  The 
element  _L  is  the  least  element  of  the  semilattice  and  is  an 
identity  for  +. 

We  also  assume  that  there  are  no  infinite  ascending 
chains  in  L;  this  is  known  as  the  ascending  chain  condition 
(ACC).  It  follows  from  this  assumption  that  there  exists  a 
maximum  element  T. 

Intuitively,  lower  elements  in  the  semilattice  represent 
more  specific  information,  and  the  join  operation  repre¬ 
sents  disjunction  of  information.  For  example,  in  the  Java 
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class  hierarchy,  the  join  of  String  and  StringBuf  fer 
is  Object,  their  least  common  ancestor  in  the  hierarchy. 

The  element  T  represents  a  type  error.  In  practice,  any 
attempt  by  a  dataflow  analysis  computation  to  form  a  join 
x  +  y  that  does  not  make  sense  indicates  a  fatal  type  error, 
and  the  analysis  will  be  aborted.  We  represent  this  situation 
mathematically  by  x  +  y  =  T. 

The  element  _L  represents  “unlabeled”.  For  example,  the 
initial  labeling  in  the  worklist  algorithm  is  a  map  wq  :  V  — ► 
L,  where  V  is  the  set  of  vertices  of  the  control  flow  graph, 
such  that  tt>o(so)  is  the  initial  dataflow  information  available 
at  the  start  node  so,  and  wo(u)  =  _L  for  all  other  nodes 
u  G  V. 

The  ascending  chain  condition  (ACC)  is  a  standard  as¬ 
sumption  that  ensure  that  dataflow  computations  always 
converge. 


and  for  the  *  operator, 

1  +  a*  a  <  a*  (1) 

b  +  xa  <  x  =>  bci*  <  x,  (2) 

where  <  refers  to  the  natural  partial  order  on  K : 

a  <  b  <£=>  a  +  b  =  b. 

Instead  of  (2),  we  might  take  the  equivalent  axiom 

xa  <  x  =>•  xa*  <  x  (3) 

One  can  show  that  all  the  operations  are  monotone  with 
respect  to  <.  The  proof  of  monotonicity  of  multiplication 
does  not  need  distributivity,  but  only  predistributivity.  One 
can  also  show  that  the  inequality  (1)  can  be  strengthened  to 
an  equality. 


2.2  Kleene  Algebra 


2.3  Matrices 


Kleene  algebra  (KA)  was  introduced  by  S.  C.  Kleene 
[7]  (see  also  [4]).  We  define  a  Kleene  algebra  (KA)  to  be 
a  structure  ( K ,  +,  •,  *,  0,  1),  where  ( K ,  +,  0,  1)  is  an 
idempotent  semiring,  a*  bis  the  least  x  such  that  b+ax  <  x, 
and  ba*  the  least  x  such  that  b  +  xa  <  x.  Here  “least” 
refers  to  the  natural  partial  order  a  <  6  a  +  h  =  b.  The 
operation  +  gives  the  supremum  with  respect  to  <.  This 
particular  axiomatization  is  from  [9],  We  normally  omit  the 
■,  writing  ab  for  a  ■  b.  The  precedence  of  the  operators  is 
*>•>+.  Thus  a  +  be*  should  be  parsed  a  +  (6(c*)). 

KA  has  a  rich  algebraic  theory  with  many  natural  and 
useful  models:  language-theoretic,  relational,  trace-based, 
matrix.  Standard  models  include  the  family  of  regular  sets 
of  strings  over  a  finite  alphabet,  the  family  of  binary  rela¬ 
tions  on  a  set,  and  the  family  ofnxrt  matrices  over  another 
Kleene  algebra.  We  refer  the  reader  to  [9]  for  further  defi¬ 
nitions  and  basic  results. 

For  this  paper,  we  consider  a  weaker  axiomatization.  We 
will  assume  that  the  algebra  is  left-distributive,  but  not  nec¬ 
essarily  right-distributive.  However,  we  will  assume  that  it 
is  right  -predistributive.  That  is,  we  assume  that  ab  +  ac  = 
a(b  +  c),  but  only  ac+bc  <  (a  +  b)c.  Moreover,  we  will 
not  require  that  a*  b  be  the  least  x  such  that  b  +  ax  <  x, 
but  only  that  ba*  be  the  least  x  such  that  b  +  xa  <  x.  Such 
algebras  are  called  a  left-handed  Kleene  algebras. 

The  following  summarizes  the  axioms  of  left-handed 
KA: 


a+  (b  +  c) 
a  +  b 
a  +  0 
ab  +  ac 


(a  +  b)  +  c  a(bc) 

b  +  a  la 

a  +  a  =  a  0a 

a(b  +  c)  ac+  be 


< 


( ab)c 
al  =  a 
a0  =  0 
(a  +  b)c 


As  mentioned,  the  n  x  n  matrices  over  a  Kleene  algebra 
again  form  a  Kleene  algebra  under  the  appropriate  defini¬ 
tions  of  the  operators.  One  can  establish  that  matrices  over 
a  left-handed  algebra  are  left-handed,  and  those  over  a  right- 
handed  algebra  are  right-handed.  The  proofs  are  symmetric. 
Unfortunately,  the  proof  given  in  [9]  uses  distributivity  on 
both  sides  to  show  only  right-handedness,  so  technically  it 
does  not  suffice  to  establish  the  result  we  need.  We  there¬ 
fore  supply  a  proof  here  for  completeness.  We  show  that 
2x2  matrices  over  a  left-handed  KA  are  left-handed;  the 
result  for  general  n  follows  by  induction  as  in  [9],  We  also 
show  only  (1)  and  (3);  the  other  laws  are  all  straightforward. 

Define  the  matrix  E*  from  E  as  follows: 


E  = 


a  b 
c  d 


E* 


f*bd* 


9  ca 


where  /  =  a  +  bd*c  and  g  =  d  +  ca*b.  The  inequality  (1) 
for  E  is  I  +  E*E  <  E* ,  which  reduces  to  the  following 
inequalities  over  K: 

1  +  .f*a  +  f*bd*c  <  f* 
f*b+f*bd*d  <  f*bd* 
g  ca  a  +  g  c  <  g  ca 
l  +  g*ca*b  +  g*d  <  g* . 


For  the  first  two, 

1  +  f*a  +  f*bd*c  = 

f*b+f*bd*d  = 


1  +  f*  (a  +  bd*  c ) 

1  +  /*/  <  /*, 
f*b(l  +  d*d)  <  f*bd*, 


and  the  other  two  are  symmetric.  To  show  (3)  for  E,  we 
must  show  that  XE  <  X  implies  XE*  <  X.  We  can 
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show  this  independently  for  each  row  of  X.  This  reduces  to 
the  task  of  showing 

xf  +  yg  ca 
xf*bd*  +  yg* 

under  the  assumptions 

xa  +  yc  <  x  xb  +  yd  <  y. 

By  symmetry,  we  need  only  show  (4).  By  the  property  (3) 
for  K,  we  have  xa*  <  x  and  yd*  <  y,  therefore 

xf  =  x(a  +  bd*  c)  <  xa  +  xbd*c 
<  xa  +  yd*c  <  xa  +  yc  <  x. 

It  follows  from  (3)  that  xf*  <  x.  By  a  symmetric  argu¬ 
ment,  yg*  <  y.  Thus 

yg  ca  <  yea  <  xa  <  x. 

These  two  inequalities  establish  (4),  hence  (3)  for  E. 

In  applications,  we  will  be  considering  left-handed 
Kleene  algebras  of  monotone  functions  on  a  semilattice  sat¬ 
isfying  the  ascending  chain  condition. 

2.4  Strict  Monotone  Functions  on  a  Semilattice 

We  model  transfer  functions  as  strict  monotone  functions 
/  :  L  — >  L,  where  L  is  an  upper  semilattice  satisfying  the 
ascending  chain  condition.  The  maps  /  must  satisfy 

x  <  y  =>  f(x)  <  f(y)  (6) 

f(-L)  =  A-  (7) 

There  are  particular  strict  monotone  functions 

0  =  Ax._L  1  =  Xx.x. 

The  domain  of  /  is  the  set 

dom  /  =  {x  £  L  |  /( x)  ^  T}. 

The  property  (6)  implies  that  dom  /  is  closed  downward 
under  <. 

Let  K  denote  the  family  of  strict  monotone  functions  on 
L.  We  can  impose  a  left-handed  Kleene  algebra  structure  on 
K  as  follows.  First,  define  addition  of  functions  pointwise: 

(f  +  g)(x)  =  f(x)  +  g(x) 

f  <  g  ^  f  +  g  =  g- 

Under  this  definition,  K  forms  an  upper  semilattice  with 
least  element  0. 

Elements  of  K  can  be  composed  using  ordinary  func¬ 
tional  composition.  The  operator  is  written  •  and  the  com¬ 
position  of  /  followed  by  g  is  written  fg;  thus  ( fg){x )  = 


g(f(x)).  Note  that  x  £  dom  fg  iff  x  £  dom  /  and 
f(x)  £  dom  g.  The  identity  function  1  is  a  two-sided  iden¬ 
tity  for  composition  and  0  is  a  two-sided  annihilator. 

Composition  distributes  over  +  on  the  left,  but  not  neces¬ 
sarily  on  the  right.  However,  it  is  right-subdistributive  due 
to  monotonicity.  Thus  K  forms  a  left-handed  idempotent 
semiring  under  the  operations  +,-,0,1. 

The  element  f*  is  defined  as  the  function  which  on  input 
x  gives  the  least  y  such  that  x  +  f(y)  <  y.  In  symbols, 

f*(x)  =  yy.(x  +  f(y)  <  y), 

where  ji  is  the  usual  least-fixpoint  operator.  The  least  fix- 
point  exists,  since  /  is  monotone  and  the  ACC  holds,  so  the 
monotone  sequence 

X,  x  +  f(x),  x  +  f(x  +  f(x)),  ... 

converges  after  a  finite  number  of  steps,  but  not  necessarily 
uniformly  bounded  in  x;  a  counterexample  is  given  by  the 
semilattice  consisting  of  N  U  {oo}  with  min  as  join  and 
the  strict  monotone  function  /  that  on  input  x  gives  oo  if 
x  =  oo,  x  —  1  if  x  >  1,  and  0  if  x  =  0. 

To  show  (1),  we  need  to  show  that  1  +  f*  f  <  /*,  or  in 
other  words,  for  an  arbitrary  x,  x+  f(f*  (a;))  <  f*  (x).  But 
this  is  true,  since  f*  (x)  is  defined  to  be  the  least  element 
with  this  property. 

Finally,  to  show  (3),  we  need  to  show  that  gf  <  g  im¬ 
plies  gf*  <  g,  or  in  other  words,  f*{g{x))  <  g(x)  when¬ 
ever  f  (g{x))  <  g (x).  But  if  f(g( x))  <  g(x),  then  g(x) 
satisfies  g(x)  +  f(Y)  <  Y,  and  f*(g(x))  is  the  least  such 
element. 

We  have  shown 

Theorem  2.1  The  structure  ( K ,  0,  1)  is  a  left- 

handed  Kleene  algebra. 

3  A  Second-Order  Approach 

In  this  section  we  present  a  general  second-order  ap¬ 
proach  to  static  analysis.  The  technique  exploits  the  ability 
to  compute  the  Kleene  algebra  operations  on  transfer  func¬ 
tions  as  defined  above. 

We  are  given  a  program  with  n  instructions,  and  we  wish 
to  label  the  underlying  control  flow  graph  G  of  the  program 
with  elements  of  the  semilattice  L.  Let  E  be  the  n  x  n 
matrix  with  rows  and  columns  indexed  by  the  vertices  of 
G  such  that  if  (s,t)  is  an  edge  of  G,  then  E[s,t]  is  the 
transfer  function  labeling  the  edge  (s,  t),  and  E[s,t]  =  0 
if  (s,  t)  is  not  an  edge  of  G.  This  matrix  is  easily  constructed 
in  a  single  pass  thorough  the  program. 

Recall  from  Section  2.3  that  the  n  x  n  matrices  over  a 
left-handed  Kleene  algebra  again  form  a  left-handed  Kleene 
algebra.  We  can  thus  speak  of  the  matrix  E* .  The  entry 


<  x  (4) 

<  y  (5) 
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E*  [ u,  v  ]  is  the  join  of  the  composition  of  transfer  func¬ 
tions  along  all  paths  from  u  to  v.  The  desired  fixpoint 
dataflow  labeling  at  any  node  u  of  G  can  be  obtained  by 
evaluating  E*  [so,tr]  (4)-  where  £0  £  L  is  the  initial  la¬ 
bel  of  the  start  node  so-  Thus  the  inductive  labeling  of  the 
control  flow  graph  is  replaced  with  the  computation  of  the 
matrix  E* . 

A  concrete  implementation  of  this  method  is  described  in 
[8]  in  the  context  of  Java  bytecode.  In  that  implementation, 
E*  is  not  computed  directly,  but  rather  used  in  conjunction 
with  the  worklist  algorithm  to  obtain  a  hybrid  method  that 
uses  matrix  closure  on  a  small  cutset  to  avoid  recalculation 
dataflow  information  along  long  paths  in  the  control  flow 
graph. 

4  Soundness  and  Completeness 


vertices,  since  io*(ir)  =  _L  and  E  [u,  i>]  (_L)  =  _L  •<  w*(v). 
Note  that  v  may  still  be  reachable  from  sq,  even  if  u  is  n°t. 

To  compare  this  algorithm  to  our  second-order  algorithm 
of  Section  3,  it  will  be  convenient  to  label  vertices  with  cer¬ 
tain  functions  L  — >  L  instead  of  elements  of  L.  We  lift  an 
element  x  £  L  to  an  almost-constant  function  x  £  K  as 
follows: 

,  f  x,  if  y  A  _L 

x(y)  =  <  .  ' 

'  (  _L,  otherwise. 

Note  that  _L  =  0.  The  value  x  can  be  recovered  from  x  by 
applying  x  to  any  element  of  L  besides  _L.  The  advantage 
of  using  lifted  values  is  that  function  application  becomes 
composition,  which  is  a  Kleene  algebra  operation: 

f(x)  =  xf.  (9) 


We  argue  in  this  section  that  the  second-order  algo¬ 
rithm  proposed  in  Section  3  and  the  standard  worklist  al¬ 
gorithm  produce  the  same  final  dataflow  labeling  for  any 
type-correct  program. 

Let  L  be  an  upper  semilattice  satisfying  the  ACC  as  de¬ 
scribed  in  Section  2.1,  and  let  K  be  the  left-handed  Kleene 
algebra  of  strict  monotone  functions  L  — »  L  as  described 
in  Section  2.4.  Let  G  be  a  control  flow  graph  with  vertices 
V,  n  =  \V\,  start  node  So  £  V,  and  edges  labeled  with 
transfer  functions  /  £  K.  Let  £0  £  L  be  the  initial  dataflow 
information  at  so- 

Formally,  the  worklist  algorithm  computes  a  sequence  of 
labelings  wn  :  V  — >  L,  n  >  0,  as  follows.  We  start  with  the 
initial  labeling 

/  \  f  f-0)  if  u  So 

w°(u}  |  _L?  otherwise. 


At  stage  n,  say  we  have  constructed  a  labeling  wn.  To  get 
wn+ 1,  we  take  the  next  edge  (it,  v)  from  the  worklist,  apply 
the  associated  transfer  function  E  [u,v]  to  the  current  label 
wn (u)  of  u,  and  update  the  label  of  v  with  that  value.  Thus 


wn+i(t) 


E[u,v](wn(u))  +  wn(v),  iff  =  it, 
wn(t),  if  t^v. 


If  w  :  V  — >  L  is  a  labeling,  we  can  lift  it  to  a  second- 
order  labeling  w  :  V  — >  K  by  taking 

w[u]  =  w(u). 


For  example,  the  lifted  version  of  the  initial  labeling  w o  is 


w0  [it] 


f  4,  if  u  =  s0 

\  0,  otherwise. 


Although  both  w  and  w  are  functions  on  V,  we  write  w[  u  ] 
with  square  brackets  because  we  will  be  regarding  it  as  a 
row  vector  of  length  n  and  using  it  in  matrix- vector  compu¬ 
tations. 

We  are  now  ready  to  prove  our  main  theorem. 


Theorem  4.1  Let  ti>*  :  V  L  be  the  final  dataflow  label¬ 
ing  produced  by  the  worklist  algorithm.  Then  for  all  u  £  V, 

E*  [s0, u]  (4)  =  w*(ir).  (10) 

Proof.  Let  Wq  be  the  initial  labeling,  and  consider  the 
matrix-vector  product  WqE*,  which  is  a  row  vector  V  — > 
K.  We  first  show  that 


The  sequence  wo,  w\, . . .  is  monotone  and  converges  to  a 
fixpoint 

tr*  =  supw„. 

n 

This  labeling  is  the  least  labeling  such  that  for  all  u  reach¬ 
able  from  the  start  node  so, 

E[u,v](w*(u))  <  w*(v)  (8) 


■uio  E*  =  w *.  (11) 

By  (2),  the  left-hand  side  is  the  least  solution  of 

w0  +  XE<  X,  (12) 

so  it  suffices  to  show  that  w*  is  as  well,  since  the  least  solu¬ 
tion  of  (12)  is  unique.  For  all  u  £  V  and  x  /  _L, 

u>o  [u]  (x)  =  w0(u)  <  sup  wn(u) 

n 

=  w*(u)  =  w*  [it]  (x), 


[6].  For  vertices  u  not  reachable  from  sq,  the  worklist  al¬ 
gorithm  will  never  see  them,  but  (8)  will  still  hold  for  those 
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therefore  Wq  <  w* .  Similarly, 


(w*E)  [w]  (at)  = 


< 


V 

tu*  [y]  E1  [t>,  u]  (at) 

V 

^2E[v,u]  (w*(v)) 

V 

w*(u)  by  (8) 

w*[u]  (x). 


Since  u  and  j  /  1  were  arbitrary,  w*E  <  w*.  Since  both 
Wo  <  w*  and  w*E  <  w*.  we  have  w0  +  w*E  <  w*, 
therefore  tu*  is  a  solution  to  (12). 

To  show  that  it  is  the  least  solution,  let  X  be  any  other 
solution.  Then  wq  <  X.  Reasoning  inductively,  suppose 
that  wn  <  X.  Let  (s,  t)  be  the  edge  selected  by  the  worklist 
algorithm  at  stage  n.  For  at  ^  _L, 


Wn+i  [u]  (at) 

=  wn+i(u) 

(  E[v,u](w„(v))  +  wn(u),  if  (v,  u)  =  (s,  t) 
(  wn(u),  otherwise 

<  E  E  [i>,  it]  (w„(v))  +  wn(u ) 

V 

=  yy  w„  [v]E[v,u]  (at)  +  wn  [u]  (at) 

V 

=  ( wnE )  [u]  (at)  +  wn  [w]  (at) 

=  (wn(E  +  1))  [y]  (x), 
therefore  wn+\  <  wn(E  +  1).  It  follows  that 

wn+ 1  <  wn(E  +  1)  <  X(E+  1) 

<  XE  +  X  <  X. 


Then  yt*  =  sup„  wn  <  X,  so  y)*  is  the  least  solution  of 

(12). 

Finally,  we  show  how  (10)  follows  from  (1 1).  Let  u  £  V. 
Since  wo  [y  ]  =0  for  y  ^  s0,  we  have 

w0  [s0 1  E*  [s0,y] 

=  W0  [so ]  E*  [s0,y]  +  yy  y>0  [y]  E*  [y,y] 

V^so 

=  y;  wq  [y]  E*  [y,  y] 

V 

=  (■ w0E *)  [y] 

=  w*  [y]  by  (11), 


thus  for  any  x  £  L  —  {_L}, 
E*  [s0,y]  (4)  = 


E *  [s0,y]  (tuo(so)) 
w0  [s0 ]  E*  [So,  U]  (at) 
y)*  [y]  (at) 
ty*(y). 


□ 


4.1  Conclusions  and  Future  Work 

We  would  like  to  implement  the  hybrid  algorithm  de¬ 
scribed  in  [8]  and  compare  it  experimentally  to  the  standard 
worklist  algorithm  as  specified  in  the  Java  VM  specification 
[11],  This  should  not  be  difficult,  since  we  already  have  an 
implementation  of  the  latter  [2], 

The  efficacy  of  our  hybrid  algorithm  depends  on  find¬ 
ing  a  small  cutset  in  the  control  flow  graph;  that  is,  a  set  of 
nodes  intersecting  every  directed  cycle.  Finding  a  minimum 
cutset  is  known  to  be  A'F-complete,  but  solvable  in  polyno¬ 
mial  time  for  reducible  graphs  [5],  Flowgraphs  of  bytecode 
programs  compiled  from  Java  source  would  ordinarily  be 
reducible.  In  practice,  simply  taking  set  of  all  targets  of 
back  edges  should  give  a  very  small  cutset. 

To  test  this,  we  collected  some  rough  empirical  evidence 
from  a  sample  of  Java  bytecode  programs.  Of  537  programs 
analyzed,  the  median  cutset  size  as  a  percentage  of  total 
program  size  was  2.1%.  All  except  five  programs  were  less 
than  5%.  The  largest  program  analyzed  was  2668  instruc¬ 
tions  with  5  cutpoints,  or  0.2%.  These  are  very  encouraging 
numbers  indeed. 

It  is  also  apparent  that  our  second-order  method  is 
amenable  to  parallelization.  The  worklist  method  is  inher¬ 
ently  sequential,  since  each  application  of  a  transfer  func¬ 
tion  requires  knowledge  of  its  inputs,  whereas  composi¬ 
tions  can  be  computed  without  knowing  their  inputs.  This 
presents  another  intriguing  possibility  that  we  would  like  to 
investigate. 

Acknowledgments 

We  are  indebted  to  Stephen  Chong,  Andrew  Myers,  and 
Radu  Rugina  for  valuable  discussions.  This  work  was  sup¬ 
ported  in  part  by  NSF  grant  CCR-0105586  and  ONR  Grant 
N00014-01-1-0968.  The  views  and  conclusions  contained 
herein  are  those  of  the  authors  and  should  not  be  interpreted 
as  necessarily  representing  the  official  policies  or  endorse¬ 
ments,  either  expressed  or  implied,  of  these  organizations 
or  the  US  Government. 

References 

[1]  M.  Abadi  and  R.  Stata.  A  type  system  for  Java  bytecode 
subroutines.  In  Proc.  25th  Symp.  Principles  of  Programming 
Languages,  pages  149-160.  ACM  SIGPLAN/SIGACT,  Jan¬ 
uary  1998. 

[2]  F.  Adelstein,  D.  Kozen,  and  M.  Stillerman.  Malicious  code 
detection  for  open  firmware.  In  Proc.  18th  Computer  Se¬ 
curity  Applications  Conf.  (ACSAC’02),  pages  403-412,  De¬ 
cember  2002. 

[3]  T.  Ball  and  S.  K.  Rajamani.  The  SLAM  project:  Debugging 
system  software  via  static  analysis.  In  Proc.  Conf.  Principles 


6 


of  Programming  Languages  (POPL’02),  pages  1-3.  ACM, 
January  2002. 

[4]  J.  H.  Conway.  Regular  Algebra  and  Finite  Machines.  Chap¬ 
man  and  Hall,  London,  1971. 

[5]  M.  R.  Garey  and  D.  S.  Johnson.  Computers  and  Intractibil- 
ity:  A  Guide  to  the  Theory  of  NP-Completeness.  W.H.  Free¬ 
man,  1979. 

[6]  G.  A.  Kildall.  A  unified  approach  to  global  program  opti¬ 
mization.  In  Proc.  Conf.  Principles  of  Programming  Lan¬ 
guages  (POPL’73),  pages  194-206.  ACM,  1973. 

[7]  S.  C.  Kleene.  Representation  of  events  in  nerve  nets  and 
finite  automata.  In  C.  E.  Shannon  and  J.  McCarthy,  editors. 
Automata  Studies,  pages  3 — 4 1 .  Princeton  University  Press, 
Princeton,  N.J..  1956. 

[8]  L.  Kot  and  D.  Kozen.  Kleene  algebra  and  bytecode  verifi¬ 
cation.  Technical  Report  2004-1972,  Computer  Science  De¬ 
partment,  Cornell  University,  December  2004. 

[9]  D.  Kozen.  A  completeness  theorem  for  Kleene  algebras 

and  the  algebra  of  regular  events.  Inf  or  and  Comput., 

1 10(2):366— 390,  May  1994. 

[10]  D.  Kozen.  Kleene  algebras  with  tests  and  the  static  analysis 
of  programs.  Technical  Report  2003-1915,  Computer  Sci¬ 
ence  Department,  Cornell  University,  November  2003. 

[11]  T.  Lindholm  and  F.  Yellin.  The  JAVA  virtual  machine  speci¬ 
fication.  Addison  Wesley,  1996. 

[12]  F.  B.  Schneider.  Enforceable  security  policies.  ACM 
Trans.  Information  and  System  Security,  3(1):30 — 50,  Febru¬ 
ary  2000. 


7 


